1,913 research outputs found

    Boosting in Cox regression: a comparison between the likelihood-based and the model-based approaches with focus on the R-packages CoxBoost and mboost

    Get PDF
    Despite the limitations imposed by the proportional hazards assumption, the Cox model is probably the most popular statistical tool used to analyze survival data, thanks to its flexibility and ease of interpretation. For this reason, novel statistical/machine learning techniques are usually adapted to fit it, including boosting, an iterative technique originally developed in the machine learning community and later extended to the statistical field. The popularity of boosting has been further driven by the availability of user-friendly software such as the R packages mboost and CoxBoost, both of which allow the implementation of boosting in conjunction with the Cox model. Despite the common underlying boosting principles, these two packages use different techniques: the former is an adaption of the model-based boosting, while the latter adapts the likelihood-based boosting. Here we contrast these two boosting techniques as implemented in the R packages from an analytic point of view, and we examine the solutions there adopted to treat mandatory variables, i.e. variables that for some reasons must be included in the model. We explore the possibility of extending solutions currently only implemented in one package to the other. We illustrate the usefulness of these extensions through the application to two real data examples

    Clustering via nonparametric density estimation: an application to microarray data.

    Get PDF
    Cluster analysis is a crucial tool in several biological and medical studies dealing with microarray data. Such studies pose challenging statistical problems due to dimensionality issues, being the number of variables much higher than the number of observations. Here, we present a novel approach to clustering of microarray data via nonparametric density estimation, based on the following steps: (i) selection of relevant variables; (ii) dimensionality reduction; (iii) clustering of observations in the reduced space. Applications on simulated and real data show promising results in comparison with those produced by two standard approaches, k-means and Mclust. In the simulation studies, our nonparametric approach shows performances comparable to those of models based on normality assumption, even in Gaussian settings. On the other hand, in two benchmarking real datasets, it outperforms the existing parametric approaches

    A U-statistic estimator for the variance of resampling-based error estimators

    Get PDF
    We revisit resampling procedures for error estimation in binary classification in terms of U-statistics. In particular, we exploit the fact that the error rate estimator involving all learning-testing splits is a U-statistic. Therefore, several standard theorems on properties of U-statistics apply. In particular, it has minimal variance among all unbiased estimators and is asymptotically normally distributed. Moreover, there is an unbiased estimator for this minimal variance if the total sample size is at least the double learning set size plus two. In this case, we exhibit such an estimator which is another U-statistic. It enjoys, again, various optimality properties and yields an asymptotically exact hypothesis test of the equality of error rates when two learning algorithms are compared. Our statements apply to any deterministic learning algorithms under weak non-degeneracy assumptions. In an application to tuning parameter choice in lasso regression on a gene expression data set, the test does not reject the null hypothesis of equal rates between two different parameters

    Modelling publication bias and p-hacking

    Full text link
    Publication bias and p-hacking are two well-known phenomena that strongly affect the scientific literature and cause severe problems in meta-analyses. Due to these phenomena, the assumptions of meta-analyses are seriously violated and the results of the studies cannot be trusted. While publication bias is almost perfectly captured by the weighting function selection model, p-hacking is much harder to model and no definitive solution has been found yet. In this paper we propose to model both publication bias and p-hacking with selection models. We derive some properties for these models, and we compare them formally and through simulations. Finally, two real data examples are used to show how the models work in practice.Comment: 21 pager, 6 figure

    Predicting time to graduation at a large enrollment American university

    Full text link
    The time it takes a student to graduate with a university degree is mitigated by a variety of factors such as their background, the academic performance at university, and their integration into the social communities of the university they attend. Different universities have different populations, student services, instruction styles, and degree programs, however, they all collect institutional data. This study presents data for 160,933 students attending a large American research university. The data includes performance, enrollment, demographics, and preparation features. Discrete time hazard models for the time-to-graduation are presented in the context of Tinto's Theory of Drop Out. Additionally, a novel machine learning method: gradient boosted trees, is applied and compared to the typical maximum likelihood method. We demonstrate that enrollment factors (such as changing a major) lead to greater increases in model predictive performance of when a student graduates than performance factors (such as grades) or preparation (such as high school GPA).Comment: 28 pages, 11 figure

    Integrated likelihoods in models with stratum nuisance parameters

    Get PDF
    Inference about a parameter of interest in presence of a nuisance parameter can be based on an integrated likelihood function. We analyze the behaviour of inferential quantities based on such a pseudo-likelihood in a two-index asymptotics framework, in which both sample size and dimension of the nuisance parameter may diverge to infinity. We show that the integrated likelihood, if chosen wisely, largely outperforms standard likelihood methods, such as the profile likelihood. These results are confirmed by simulation studies, in which comparisons with modified profile likelihood are also considered

    Added predictive value of omics data: specific issues related to validation illustrated by two case studies

    Get PDF
    In the last years, the importance of an independent validation for the prediction ability of a new gene signature has been largely recognized. Recently, with the development of gene signatures which integrate rather than substitute the clinical predictors in the prediction rule, the focus has been moved to the validation of the added predictive value of a gene signature, i.e. to the verification that the inclusion of the new gene signature in a prediction model is able to improve its prediction ability. The high-dimensional nature of the data from which a new signature is derived raises challenging issues and necessitates to modify classical methods to adapt them to this framework. Here we show how to validate the added predictive value of a signature derived from high-dimensional data and critically discuss the impact of the choice of the different methods on the results. The analysis of the added predictive value of two gene signatures developed in two recent studies on the survival of leukemia patients allows us to illustrate and empirically compare different validation techniques in the high-dimensional framework

    Influence of single observations on the choice of the penalty parameter in ridge regression

    Full text link
    Penalized regression methods, such as ridge regression, heavily rely on the choice of a tuning, or penalty, parameter, which is often computed via cross-validation. Discrepancies in the value of the penalty parameter may lead to substantial differences in regression coefficient estimates and predictions. In this paper, we investigate the effect of single observations on the optimal choice of the tuning parameter, showing how the presence of influential points can dramatically change it. We distinguish between points as "expanders" and "shrinkers", based on their effect on the model complexity. Our approach supplies a visual exploratory tool to identify influential points, naturally implementable for high-dimensional data where traditional approaches usually fail. Applications to real data examples, both low- and high-dimensional, and a simulation study are presented.Comment: 26 pages, 6 figure

    A Machine Learning Approach to Safer Airplane Landings: Predicting Runway Conditions using Weather and Flight Data

    Full text link
    The presence of snow and ice on runway surfaces reduces the available tire-pavement friction needed for retardation and directional control and causes potential economic and safety threats for the aviation industry during the winter seasons. To activate appropriate safety procedures, pilots need accurate and timely information on the actual runway surface conditions. In this study, XGBoost is used to create a combined runway assessment system, which includes a classifcation model to predict slippery conditions and a regression model to predict the level of slipperiness. The models are trained on weather data and data from runway reports. The runway surface conditions are represented by the tire-pavement friction coefficient, which is estimated from flight sensor data from landing aircrafts. To evaluate the performance of the models, they are compared to several state-of-the-art runway assessment methods. The XGBoost models identify slippery runway conditions with a ROC AUC of 0.95, predict the friction coefficient with a MAE of 0.0254, and outperforms all the previous methods. The results show the strong abilities of machine learning methods to model complex, physical phenomena with a good accuracy when domain knowledge is used in the variable extraction. The XGBoost models are combined with SHAP (SHapley Additive exPlanations) approximations to provide a comprehensible decision support system for airport operators and pilots, which can contribute to safer and more economic operations of airport runways

    Non-parametric Bayesian modeling of cervical mucus symptom

    Get PDF
    The analysis of the cervical mucus symptom is useful to identify the period of maximum fertility of a woman. In this paper we analyze the daily evolution of the cervical mucus symptom during the menstrual cycle, based on the data collected in two retrospective studies, in which the mucus symptom is treated as an ordinal variable. To produce our statistical model, we follow a non-parametric Bayesian approach. In particular, we use the idea of non-parametric mixtures of rounded continuous kernels, recently proposed in literature to deal with categorical functional data. Fitting the model, we identify the typical pattern of the mucus symptom during the menstrual cycle, i.e. a slow increase of the fertility until the ovulation and, in the aftermath, a steep decrease to a situation less favorable for the fecundation. From the results, it is possible to extract useful information to predict the beginning of the most fertile period and, in case, to identify possible physio-pathological conditions. As a by-product of our analysis, we are able to group the menstrual cycles based on the differences in the daily evolution of the cervical mucus symptom. This division may help in the identification of cycles with particular characteristics
    corecore